A New Framework for Data Streams Classification
نویسندگان
چکیده
Mining data streams has recently become an important and challenging task for a wide range of services, including credit card fraud detection, sensor networks and web applications. In these applications data do not typically take the form of persistent relations, but tend to arrive in multiple, continuous, rapid and timevarying data streams. Hence, conventional knowledge discovery tools cannot manage this overwhelming volume of streaming data. The nature of data streams requires the use of algorithms, which involve at most one pass over the data, and try to keep track of time-evolving features, also known as concept drifting. In the literature, several techniques for the efficient computation of compact space representations of massive data have been proposed to improve model reliability, including methods to cope with concept drifting. The challenges of these methods are that they are allowed to use small space and time to process a single item, while they must provide an accurate representation of some relevant characteristics of data streams. Based on the above considerations, our approach attempts to solve the problem of data streams classification, presenting a framework based on two main components. First, an on-line component is used to compute and manage some statistical aggregates information about data streams. The latter are viewed as divided into chunks, and an aggregate representation, called snapshot, is defined and built out of each stream chunk. All the statistics produced are stored in a summary structure called frame, using different levels of granularity. Second, a mining component is employed to extract a classifier from all the snapshots produced. In order to maximize the number of elements from which the model is extracted, this component employs snapshots with different levels of granularity, while only most recent data are used in the presence of concept drifting. We argue that the proposed approach avoids the continuous update of the same mining model, i.e. new models are continually extracted from scratch, while the system constantly updates summarized data structures. Furthermore, this approach employs a selective ensemble classifier combining the predictive power of the models extracted with different and increasingly larger data chunks representing different portions of data streams. Since both the performance and the accuracy of the whole system are based on single model evaluation, it is necessary to check the set of actually activatable models to classify new instances,
منابع مشابه
Priority Setting Meets Multiple Streams: A Match to Be Further Examined?; Comment on “Introducing New Priority Setting and Resource Allocation Processes in a Canadian Healthcare Organization: A Case Study Analysis Informed by Multiple Streams Theory”
With demand for health services continuing to grow as populations age and new technologies emerge to meet health needs, healthcare policy-makers are under constant pressure to set priorities, ie, to make choices about the health services that can and cannot be funded within available resources. In a recent paper, Smith et al apply an influential policy studies framework – Kingdon’s multiple str...
متن کاملA New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملClassification of encrypted traffic for applications based on statistical features
Traffic classification plays an important role in many aspects of network management such as identifying type of the transferred data, detection of malware applications, applying policies to restrict network accesses and so on. Basic methods in this field were using some obvious traffic features like port number and protocol type to classify the traffic type. However, recent changes in applicat...
متن کاملRiver System Style Frameworks, New Approach to Management of rivers (Case Study: Karaj Arangeh basin)
Brierley, G .J ., Fryirs, K .,(2000) . River styles, a geomorphic approach to catchment characterization: implications for river rehabilitation in Bega catchment, New South Wales, Australia . Environmental Management 25, 661–679 . Brierley, G .J ., Fryirs, K .A .,(2005). Geomorphology and River Management: Applications of the River Styles Framework . Blackwell, Oxford, UK, 298 pp . Caruso...
متن کاملStreaming Multi-label Classification
This paper presents a new experimental framework for studying multi-label evolving stream classification, with efficient methods that combine the best practices in streaming scenarios with the best practices in multi-label classification. Many real world problems involve data which can be considered as multi-label data streams. Efficient methods exist for multi-label classification in non strea...
متن کاملLearning from Data Streams with Concept Drift
Increasing access to incredibly large, nonstationary datasets and corresponding demands to analyse these data has led to the development of new online algorithms for performing machine learning on data streams. An important feature of real-world data streams is " concept drift, " whereby the distributions underlying the data can change arbitrarily over time. The presence of concept drift in a d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009